Hilton International Hotels management needs help to decide on the strategies that will lead to revenue growth.
The solutions to the following questions will help us answer our research question:
You have a dataset containing information on the various chain of hotels, including when customers made the booking, length of stay, the number of adults, children, or babies, and the number of available parking spaces, among other things. Given this dataset, perform data exploration, data cleaning, and analysis to come with appropriate recommendations.
The steps to be taken include: 1. Load dataset and preview its summarized information to get a feel of what you will be working with. 2. Carry out data cleaning. 3. Carry out data analysis. 4. Interpret results. 5. Provide recommendations based on results of analysis. 6. Challenge your solution.
For now, the data we have contains booking data which will be critical for our research specific analysis.
Let’s first load the necessary libraries
library("tidyverse") # to handle manipulation and visualization
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("reshape2") #to handle manipulation
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library("countrycode") #to handle country codes
# Load the data below
# ---
# Dataset url =
# ---
# YOUR CODE GOES BELOW
df_holtel <- read_csv("hotel_bookings 2.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): hotel, arrival_date_month, meal, country, market_segment, distrib...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numb...
## date (1): reservation_status_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df_holtel)
# Checking the first 5 rows of data
# ---
# YOUR CODE GOES BELOW
#
head(df_holtel, 5)
# Checking the last 5 rows of data
# ---
# YOUR CODE GOES BELOW
#
tail(df_holtel, 5)
# Sample 10 rows of data
# ---
# YOUR CODE GOES BELOW
#
sample_n(df_holtel, 10)
# Checking number of rows and columns
# ---
# YOUR CODE GOES BELOW
#
dim(df_holtel)
## [1] 119390 32
# Checking datatypes
# ---
# YOUR CODE GOES BELOW
#
glimpse(df_holtel)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
str(df_holtel)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
Record your general observations below: The columns have specific datatypes Some columns contain NULL as a value. This requires cleaning.
The data is originally from the article Hotel Booking Demand Datasets, by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
# Checking datatypes and missing entries of all the variables
# ---
# YOUR CODE GOES BELOW
#
str(df_holtel)
## spec_tbl_df [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : Date[1:119390], format: "2015-07-01" "2015-07-01" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_date(format = "")
## .. )
## - attr(*, "problems")=<externalptr>
# Let's remove the date column to allow NULL replacement
df_holtel <- df_holtel[,!(names(df_holtel) %in% c('reservation_status_date'))]
# we notice company have so many NULL as values we replace them with Na
df_holtel[df_holtel == 'NULL'] <- NA
colSums(is.na(df_holtel))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 488
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 16340
## company days_in_waiting_list
## 112593 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status
## 0
We observe the following from our dataset: Company, country and agent column contains “NULL” as data entries which indicates missing values, converting company NULL to NA shows 112593 missing values Children column have 4 Na values.
# Checking how many duplicate rows are there in the data
# ---
# YOUR CODE GOES BELOW
#
df_holtel[duplicated(df_holtel), ]
#dim(df_holtel[duplicated(df_holtel), ])
We choose to keep the duplicates because we don’t have a unique identifier to actually proof that we have duplicates.
# Checking if any of the columns are all null
# ---
# YOUR CODE GOES BELOW
#
colSums(is.na(df_holtel))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 4 0
## meal country
## 0 488
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 16340
## company days_in_waiting_list
## 112593 0
## customer_type adr
## 0 0
## required_car_parking_spaces total_of_special_requests
## 0 0
## reservation_status
## 0
We observe the following from our dataset: Company has 112593 missing values Country has 488 missing values agent has 16340 missing values
# Checking if any of the rows are all null
# ---
# YOUR CODE GOES BELOW
#
# df with Null values
df_null_holtel <- df_holtel[rowSums(is.na(df_holtel)) > 0,]
dim(df_null_holtel)
## [1] 119173 31
We observe the following from our dataset: There are 119173 rows with missing values.
# Checking the correlation of the features through the use of
# visualizations the correlation using heatmap
# ---
# YOUR CODE GOES BELOW
#
# Create a correlation matrix
corr_matrix <- cor(df_holtel[, unlist(lapply(df_holtel, is.numeric))])
head(corr_matrix)
## is_canceled lead_time arrival_date_year
## is_canceled 1.000000000 0.293123356 0.0166598602
## lead_time 0.293123356 1.000000000 0.0401420998
## arrival_date_year 0.016659860 0.040142100 1.0000000000
## arrival_date_week_number 0.008148065 0.126870813 -0.5405613308
## arrival_date_day_of_month -0.006130079 0.002267553 -0.0002210192
## stays_in_weekend_nights -0.001791078 0.085671133 0.0214973654
## arrival_date_week_number arrival_date_day_of_month
## is_canceled 0.008148065 -0.0061300789
## lead_time 0.126870813 0.0022675527
## arrival_date_year -0.540561331 -0.0002210192
## arrival_date_week_number 1.000000000 0.0668092530
## arrival_date_day_of_month 0.066809253 1.0000000000
## stays_in_weekend_nights 0.018207653 -0.0163542995
## stays_in_weekend_nights stays_in_week_nights
## is_canceled -0.001791078 0.02476463
## lead_time 0.085671133 0.16579936
## arrival_date_year 0.021497365 0.03088330
## arrival_date_week_number 0.018207653 0.01555830
## arrival_date_day_of_month -0.016354300 -0.02817352
## stays_in_weekend_nights 1.000000000 0.49896882
## adults children babies is_repeated_guest
## is_canceled 0.060017213 NA -0.0324910892 -0.084793418
## lead_time 0.119518693 NA -0.0209150163 -0.124409908
## arrival_date_year 0.029635144 NA -0.0131920747 0.010341317
## arrival_date_week_number 0.025909057 NA 0.0103954801 -0.030130758
## arrival_date_day_of_month -0.001565979 NA -0.0002303647 -0.006145021
## stays_in_weekend_nights 0.091871020 NA 0.0184828105 -0.087239379
## previous_cancellations previous_bookings_not_canceled
## is_canceled 0.11013281 -0.0573577232
## lead_time 0.08604180 -0.0735481679
## arrival_date_year -0.11982207 0.0292180512
## arrival_date_week_number 0.03550091 -0.0209035517
## arrival_date_day_of_month -0.02701078 -0.0002997868
## stays_in_weekend_nights -0.01277462 -0.0427152350
## booking_changes days_in_waiting_list adr
## is_canceled -0.1443809911 0.05418582 0.04755660
## lead_time 0.0001488301 0.17008418 -0.06307685
## arrival_date_year 0.0308723496 -0.05649660 0.19758009
## arrival_date_week_number 0.0055075035 0.02293303 0.07579064
## arrival_date_day_of_month 0.0106128560 0.02272754 0.03024519
## stays_in_weekend_nights 0.0632813159 -0.05415111 0.04934191
## required_car_parking_spaces total_of_special_requests
## is_canceled -0.195497817 -0.234657774
## lead_time -0.116450570 -0.095712049
## arrival_date_year -0.013684411 0.108531486
## arrival_date_week_number 0.001920423 0.026149364
## arrival_date_day_of_month 0.008683466 0.003062124
## stays_in_weekend_nights -0.018553809 0.072670830
# Create a table with long form
corr_df <- melt(corr_matrix)
head(corr_df)
# Plot the heatmap
ggplot(corr_df, aes(Var1, Var2, fill = value)) +
geom_tile(color = "black") +
geom_text(
aes(label = round(value, 2)),
color = "white"
) +
coord_fixed() +
labs(
fill="Pearson Correlation"
) +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
limit = c(-1,1)
) +
theme(
axis.title.x = element_text(angle = 61, vjust = 0.5, hjust=1),
axis.title.y = element_blank()
)
## Warning: Removed 34 rows containing missing values (geom_text).
We observe the following from our dataset:
Increase in lead time increases the chances of a booking cancellation Repeat guest have a lower chance of cancelling a booking. Guests who had previously canceled have a higher likelihood of cancelling again.
# Dropping company column because it has alot of missing values
# and we won't need to answer any of our questions
# ---
# YOUR CODE GOES BELOW
#
df_holtel <- df_holtel[,!(names(df_holtel) %in% c('company'))]
names(df_holtel)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "days_in_waiting_list" "customer_type"
## [27] "adr" "required_car_parking_spaces"
## [29] "total_of_special_requests" "reservation_status"
From the data variable description we see that the Distribution Channel categoricy that tells us about Booking distribution.
The term “TA” means “Travel Agents” The term “TO” means “Tour Operators” This allows us to fill the missing values in the agents column with TO
# We replace the mising values i.e. for TO
# ---
# YOUR GOES BELOW
#
sum(is.na(df_holtel$distribution_channel))
## [1] 0
unique(df_holtel$distribution_channel)
## [1] "Direct" "Corporate" "TA/TO" "Undefined" "GDS"
Distribution Channel doesn’t have any missing value
# We drop rows where there is no adult, baby and child as
# these records won't help us.
# ---
# YOUR GOES BELOW
#
sum(is.na(df_holtel$adults))
## [1] 0
This analysis conflicts with below one, since we have children with missing value only we will skip this
# We replace missing children values with rounded mean value
# ---
# Hint i.e. use round()
# ---
# YOUR GOES BELOW
#
df_holtel$children[is.na(df_holtel$children)] <- round(mean(df_holtel$children, na.rm = TRUE))
sum(is.na(df_holtel$children))
## [1] 0
# Checking for missing values in the dataframe
# ---
# YOUR GOES BELOW
#
#droppin na rows
df_holtel <- na.omit(df_holtel)
colSums(is.na(df_holtel))
## hotel is_canceled
## 0 0
## lead_time arrival_date_year
## 0 0
## arrival_date_month arrival_date_week_number
## 0 0
## arrival_date_day_of_month stays_in_weekend_nights
## 0 0
## stays_in_week_nights adults
## 0 0
## children babies
## 0 0
## meal country
## 0 0
## market_segment distribution_channel
## 0 0
## is_repeated_guest previous_cancellations
## 0 0
## previous_bookings_not_canceled reserved_room_type
## 0 0
## assigned_room_type booking_changes
## 0 0
## deposit_type agent
## 0 0
## days_in_waiting_list customer_type
## 0 0
## adr required_car_parking_spaces
## 0 0
## total_of_special_requests reservation_status
## 0 0
No missing values
# 1. How many bookings were cancelled?
# ---
# Visualisation: Barplot
ggplot(data = df_holtel, aes(x = is_canceled)) +
geom_bar(stat = "count", fill = "steelblue") +
geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
labs(title = "Bookings cancelled",
x = "Cancenclled State",
y = "No. of bookings") +
theme_classic() + scale_color_brewer(palette = "Set2")
# 2. What was the booking ratio between resort hotel and city hotel?
# ---
# Barplot of booking ratio between resort hotel and city hotel
# Visualize the cancellation by hotek type
ggplot(data = df_holtel,
aes(
x = hotel,
y = prop.table(stat(count)),
label = scales::percent(prop.table(stat(count)))
)) +
geom_bar(position = position_dodge(), fill = "steelblue") +
geom_text(
stat = "count",
position = position_dodge(.9),
vjust = -0.5,
size = 3
) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Booking ratio between resort hotel and city hotel",
x = "Hotel Type",
y = "Count(%)") +
theme_classic()
# 3. What was the percentage of booking for each year?
# ---
#
ggplot(data = df_holtel,
aes(
x = arrival_date_year,
y = prop.table(stat(count)),
label = scales::percent(prop.table(stat(count)))
)) +
geom_bar(position = position_dodge(), fill = "steelblue") +
geom_text(
stat = "count",
position = position_dodge(.9),
vjust = -0.5,
size = 3
) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Percentage of booking for each year",
x = "Year",
y = "Count(%)") +
theme_bw()
# 4. Which were the most busiest months for hotels?
# ---
#
# Organize the Month in proper order
df_holtel$arrival_date_month <-
factor(df_holtel$arrival_date_month, levels = month.name)
# Visualize Hotel traffic on Monthly basis
ggplot(data = df_holtel, aes(x = arrival_date_month)) +
geom_bar(fill = "steelblue") +
geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
coord_flip() + labs(title = "Month Wise Booking Request",
x = "Month",
y = "Count") +
theme_classic()
# 5. From which top 3 countries did most guests come from?
# ---
# YOUR GOES BELOW
#
# where are the people coming from
df_holtel_1 <- df_holtel[df_holtel$reservation_status == "Check-Out",]
# Subset the data to include the countries which has more than 1500 reservation request
# otherwise including all the country with few or occassional request to avoid the graph
# from being clumsy
sub_hotel <- df_holtel_1 %>%
group_by(country) %>%
filter(n() > 1500)
# Visualize the Travellor by Country.
sub_hotel$county_name <- countrycode(sub_hotel$country,
origin = "iso3c",
destination = "country.name")
# Traveller by Country per hotel wise
ggplot(sub_hotel, aes(county_name, fill = hotel)) +
geom_bar(stat = "count", position = position_dodge()) +
labs(title = "Booking Status by Country",
x = "Country",
y = "Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.background = element_blank())
# 6.a) How long do most people stay in hotels?
# b) By city and resort? Separate the data by hotel
# ---
#
# Total Stay Duration
ggplot(sub_hotel, aes(stays_in_weekend_nights + stays_in_week_nights)) +
geom_density(col = "red") +facet_wrap(~hotel) + theme_bw()
# 7. Which was the most booked accommodation type (Single, Couple, Family)?
# ---
#
ggplot(sub_hotel, aes(customer_type, fill = hotel)) +
geom_bar(stat = "count", position = position_dodge()) +
labs(title = "Hotel Preference by Customer Type",
x = "Customer Type",
y = "Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.background = element_blank())
From the above analysis, below are our recommendations:
In this step, we review our solution and implement approaches that could potentially provide a better outcome. In our case, we could propose the following question that wasn’t answered in our solution because it couldn’t have greatly contributed to our recommendation.
# When should hotels provide special offers?
# ---
# YOUR GOES BELOW
#
#one approach is check cancellation rate by month
ggplot(df_holtel, aes(arrival_date_month, fill = factor(is_canceled))) +
geom_bar() + geom_text(stat = "count", aes(label = ..count..), hjust = 1) +
coord_flip() + scale_fill_discrete(
name = "Booking Status",
breaks = c("0", "1"),
label = c("Cancelled", "Not Cancelled")
) +
labs(title = "Booking Status by Month",
x = "Month",
y = "Count") + theme_bw()
Our observations: - The best time to introduce offers is between May and September. This will reduce cancellation and increase revenue.
How does this observation tie to our solution?
During this step, you rethink and propose other ways that you can improve your solution.
How does the charge of the hotel affect revenue? How does the environment in-terms of scenery affect revenue?
a). Did we have the right data? Yes, for the initial analysis done. b). Do we need other data to answer our question? Yes, competition data to compare charges and customers reviews to analyse sentiments and how such affect revenue. c). Did we have the right question? Yes.